Computer‐extracted features of nuclear morphology in hematoxylin and eosin images distinguish stage II and IV colon tumors

Abstract We assessed the utility of quantitative features of colon cancer nuclei, extracted from digitized hematoxylin and eosin‐stained whole slide images (WSIs), to distinguish between stage II and stage IV colon cancers. Our discovery cohort comprised 100 stage II and stage IV colon cancer cases sourced from the University Hospitals Cleveland Medical Center (UHCMC). We performed initial (independent) model validation on 51 (143) stage II and 79 (54) stage IV colon cancer cases from UHCMC (The Cancer Genome Atlas's Colon Adenocarcinoma, TCGA‐COAD, cohort). Our approach comprised the following steps: (1) a fully convolutional deep neural network with VGG‐18 architecture was trained to locate cancer on WSIs; (2) another deep‐learning model based on Mask‐RCNN with Resnet‐50 architecture was used to segment all nuclei from within the identified cancer region; (3) a total of 26 641 quantitative morphometric features pertaining to nuclear shape, size, and texture were extracted from within and outside tumor nuclei; (4) a random forest classifier was trained to distinguish between stage II and stage IV colon cancers using the five most discriminatory features selected by the Wilcoxon rank‐sum test. Our trained classifier using these top five features yielded an AUC of 0.81 and 0.78, respectively, on the held‐out cases in the UHCMC and TCGA validation sets. For 197 TCGA‐COAD cases, the Cox proportional hazards model yielded a hazard ratio of 2.20 (95% CI 1.24–3.88) with a concordance index of 0.71, using only the top five features for risk stratification of overall survival. The Kaplan–Meier estimate also showed statistically significant separation between the low‐risk and high‐risk patients, with a log‐rank P value of 0.0097. Finally, unsupervised clustering of the top five features revealed that stage IV colon cancers with peritoneal spread were morphologically more similar to stage II colon cancers with no long‐term metastases than to stage IV colon cancers with hematogenous spread. © 2022 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

manual annotations of around 20 000 nuclei in the training set and 10 000 nucleus annotations in the validation set and included WSI (at 40´) for seven organs obtained from 18 different hospitals to develop generalized nucleus segmentation algorithms [45]. The nucleus segmentation module used in the current project obtained an average aggregated Jaccard index (AJI) of 0.70 on the challenge's validation dataset, which is on par with the winning entry of the challenge [45].

Comparison of other feature families
Following feature extraction, all quantitative features obtained from the images in the UHCMC training set (100 stage II and 100 stage IV CCs) were used in the training phase to identify the most important features to distinguish between stage II and stage IV CC using a combination of feature selection algorithms (mRMR and WRST) and machine learning classifiers (LDA, QDA, SVM, and random forests). Supplementary material, Table   S3 shows that the combination of Wilcoxon rank-sum test (WRST) for feature selection and random forest classifier for classification using the selected features gives the best class separation, with an AUC of 0.81 on the UHCMC initial validation set.
After choosing the best-performing feature selection (Wilcoxon rank-sum test) and machine learning classifier (random forest), we analyzed the ability of each feature family to separately identify which features were the most relevant for stage II versus stage IV classification for independent validation of the UHCMC and TCGA validation sets.
Supplementary material, Table S4 shows that features pertaining to the nuclear size (area, perimeter, and major-axis length), entropy of the nuclear orientation, and local cellular diversity (variance of nuclear contrast) were the most important ones to distinguish between stage II and stage IV CC. The AUC-ROC curves for each feature family are shown in supplementary material, Figure S2. It is clear from supplementary material, Table   S4 and Figure S2 that nuclear shape features, orientation, and the local cellular diversity yielded the best AUC-ROC for independent validation on both the UHCMC and the TCGA validation sets.

Feature descriptions
1 -Basic shape/intensity/morphology features of nuclei: This family quantifies the firstorder statistics, including mean, median, standard derivation, kurtosis, range, and skewness of all nuclear shape, size, texture, or morphology within the image. For each patch, the pre-segmented nucleus area was estimated. After that, six first-order statistics, including mean, median, standard derivation, kurtosis, range, and skewness, were calculated using all nuclei data within the patch. An example is shown in supplementary material, Figure S3. The numbers represent the pixel intensity within each nucleus. The first-order statistics could thus be calculated and are shown in supplementary material, Figure S3B. An example of a Delaunay triangulation graph is shown in supplementary material, Figure   S4.  Figure S5. The H&E image was first converted to a grayscale image. Then the gray-level co-occurrence matrix (GLCM) was calculated using the pixels from each pre-segmented nucleus. The entropy of the GLCM was then calculated using the methods introduced in supplementary material, Table S5. 4 -Local cell cluster graph (CCG): This feature family captures the nuclear architecture in the local region by constructing a local cell graph. The cell graphs were form based on the proximity of nuclei. Grouping nuclei into local clusters will construct the local nuclear graph (LNG). With the nuclei clustered, interactions between nuclei could be better characterized, and the nuclear properties which could quantify tumor morphology could be extracted more efficiently.

-Nuclear
To construct the LNG, a set of n segmented nuclei (denoted as N), where , was used to create a graph = ( ! , ! ), where represents the vertices of the graph (essentially the nuclei centroids) and represents the set of edges connecting the nuclei within G. Construction of the LNG can be achieved by linking nearby nuclei based on vicinity criteria as follows: where and are two vertices/nucleus in the LNG, represents the Euclidean distance between the two nuclei, and parameter α was set to control the density of the graph. Intuitively, was defined as the probability of two nuclei being connected in a graph. Thus, this probability could also quantify the possibility one of these nuclei being grown from the other. Since this probability decreases with an increase in distance, we probabilistically define an edge set E, such that where is a parameter empirically set to be 0.2. With an exponent of −α for α ≥ 0, a larger value of α produces sparser graphs. As α approaches 0, the graphs become densely connected and approach a complete graph. To extract the features from different sizes of the CCG, the algorithm would iterate the parameter α from 0.38 to 0.52 with a step of 0.2.
An example is shown in supplementary material, Figure S6 to demonstrate the CCG with different α.
5 -Basic shape/intensity features of nuclei in the local region: This family quantifies the first-order statistics, including mean, median, standard derivation, kurtosis, range, and skewness of the basic shape/size/texture properties for nuclei in the local region defined by the local cell cluster graph (CCG).
6 -Local cell graph tensor (CGT): This feature family captures nuclear orientation's entropy in the local region defined by the CCG. The steps to generate a CCG have been described above. For a nucleus within the local graph, " , the nuclear orientation was calculated by first performing principal component analysis on the set of nuclei boundary points [ # , # ] to obtain the principal components = [ $ , % ], where $ describes the orientation of the cell.
An angle, ( " ) ∈ [0°, 180°], was then calculated using the first principal component $ . The orientation of cell " was then set to be ( " ) counterclockwise from the vector <1,0>. One example is shown in supplementary material, Figure S7.
The entropy was then calculated using the following equation: In terms of the total number of cell runs, the cell graph shown in supplementary material, Figure S8(c) is more complex than the one shown in supplementary material, Figure S8(b).
8 -Feature-driven local cell graph (FeDeG): FeDeG investigates the interaction between different graphs [66]. It takes nuclear morphologic features such as nuclear mean intensity into consideration during the construction of cell graphs. To calculate FeDeG features, six nuclear morphologic features that describe the nuclear shape, size, and texture were first calculated from the pre-segmented nuclei.
For each 4 , a corresponding mode " , where " was initialized to be equal to 4 . Then " 6 was reclusively updated based on the mean-shift vector ! ( " 6 ): " 67$ = " 6 + ! ( " 6 ), 1 ≤ ≤ ! ( " 6 ) is the difference between the weighted mean and the center of the kernel. At the end of iteration, each 4 would find a mode " , which was used to generate the FeDeG.
After that, a Q-dimensional feature space which includes the centroid location of nuclei in the image and Q − 2 of the nuclear morphologic features were created. Thus, a multivariate kernel was defined as follows: where k is the profile of a radially symmetric kernel; ℎ 9 and ℎ : are the spatial and the nuclear morphologic component, respectively; C is the normalization constant; and ℎ 9 and ℎ : are the kernel bandwidths. A higher bandwidth would result in more neighboring data points being used to estimate the density in the Q-D feature space.
Based on the FeDeG created, four groups of features were derived to measure the interaction between FeDeGs, the size of the FeDeG, the intrinsic nuclear morphologic variation within each FeDeG, and the spatial arrangement of FeDeG.
9 -Local cellular diversity features: To quantify the morphology of nuclei, 11 nuclear morphologic features were calculated (listed in supplementary material, Table S5) using the segmented nuclei mask. For each nucleus , we have a set of nuclear features, = { , ( " ), ∈ {1 … 11}}. After that, an LNG was constructed using the methods described above. The co-occurrence nuclear morphology matrix (CM) was constructed for each LNG to quantify the local cellular diversity between nuclear sub-groups. If all the nuclei were identical in appearance, the co-occurrence matrix was a 1 ´ 1 matrix. On the other hand, the greater the diversity and range of features, the larger the co-occurrence matrix. Each i h element of the co-occurrence matrices records the frequency of a co-occurring nuclear subgroup.
In order to compute the co-occurrence matrix, the nuclear morphological features were discretized along each feature dimension, such that where ω is a quantifying factor that divides nuclei into ω sub-classes. For instance, with ω=3, nuclear size features could be categorized into three sub-classes: nuclei with large, medium, and small size.
For an image with a total number of q LNG graphs, = , ∈ {1 … }, all nuclei within each graph would be used to generate a corresponding CM in conjunction with a particular nuclear feature. A set of 120 basic shape features was used in this research. The constructed CM had a shape of ω by ω and was denoted as : ( ( , ) could in turn be expressed as follows: After the CM was constructed, higher-order statistics (Haralick measurements) were extracted from each LNG. Five common statistics, including median, standard derivation, range, kurtosis, and skewness, were calculated using the Haralick measurements of each LNG. The flowchart of extracting local cellular diversity features is shown in supplementary material, Figure S9. Figure S1. AUC-ROC curves for UHCMC and TCGA validation sets for the top five most significant discriminatory features for classification between stage II and stage IV CC with hematogenous metastases using a random forest classifier. Figure S2. AUC-ROC curves for CMC and TCGA validation sets for each feature using a Wilcoxon rank-sum test for feature selection and random forest as the classification algorithm.     The original H&E-stained image with pre-segmented nuclear contours and corresponding local cell graphs. Two typical local cell graphs (both 6-cell cliques) are shown in b and c, respectively. In c, the cell graph is decomposed into separate cliques comprising one 5-run and one 6-run. Figure S9. Flowchart for cellular diversity computation. (a) Green contours/lines indicate the nuclear boundaries and the adjacent panel shows the local nuclei graphs (LNGs) with edges between proximally located nuclei. (b) Shape, size, and texture features were then extracted for each of the nodes in the LNG and co-occurrence matrices were constructed. From these cooccurrence matrices, second-order statistics such as entropy were extracted and used to construct the cellular diversity feature vector S.

Supplementary Figures S1-S9
Supplementary Tables S1-S5     Table S5. Thirteen Haralick measurements of the co-occurrence matrix (CM). The 'intensity' refers to the quantification levels of pixel intensity here, for example.
",, represents the element in CM, where i and j represent the indices of the quantification level. ? @ are the partial probability density functions. ?7@ is the probability of CM coordinates summing to x + y, respectively. Ng is the quantification level.

Entropy
Measure